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Introduction 


The  long  range  goal  of  this  project  is  the  understanding  of  human  auditory 
processing  of  information  conveyed  by  complex,  time-varying  signals  such  as  speech, 
music  or  important  environmental  sounds.  Our  work  is  guided  by  the  assumption  that 
human  auditory  communication  is  a  “modulation  -  demodulation”  process.  That  Is,  we 
assume  that  sound  sources  produce  a  complex  stream  of  sound  pressure  waves  with 
information  encoded  as  variations  (  modulations)  of  the  signai  amplitude  and 
frequency.  The  listener’s  task  then  is  one  of  demodulation.  Much  of  past 
psychoacoustics  work  has  been  based  in  what  we  characterize  as  “spectrum  picture 
processing.”  Complex  sounds  are  Fourier  analyzed  to  produce  an  amplitude-by¬ 
frequency  “picture”  and  the  perception  process  is  modeled  as  if  the  listener  were 
analyzing  the  spectral  picture.  This  approach  leads  to  studies  such  as  “profile 
analysis”  and  the  power-spectrum  model  of  masking.  Our  approach  leads  us  to 
investigate  time-varying,  complex  sounds.  We  refer  to  them  as  dynamic  signals  and 
we  have  developed  auditory  signal  processing  models  to  help  guide  our  experimental 
work. 

Since  the  proposal  for  this  project  was  written  in  fall  1992,  we  have  re-ordered 
the  sequence  of  experiments  that  were  proposed.  Progress  will  be  described  under 
the  headings  of  the  proposal,  however,  to  facilitate  relating  our  work  to  that  document. 
Also,  since  the  start  of  the  project  was  June  1  rather  than  January  1 ,  1993,  some  tasks 
were  completed  very  early  in  the  first  year  of  funding.  Since  they  were  not  included  in 
the  final  report  of  the  previous  funding  period,  they  have  been  included  in  this  report. 

The  TDT  equipment  purchased  in  the  first  year  of  the  project  enabled  us  to 
generate  the  complex,  time-varying  signals  in  real  time.  Previously,  some  signals  had 
to  be  generated  off-line  and  stored  on  disk  for  replay  during  the  experiment.  The  real 
time  versions  of  the  signals  mean  that  we  can  run  “roving-parameter”  paradigms  in 
adaptive  tracking  procedures.  Parameters  that  can  “rove”  include  signal  frequency, 
duration  and  amplitude.  Roving  can  be  done  on  a  single  parameter  or  in 
combinations  (i.e.,  roving  frequency  and  amplitude  at  the  same  time). 


A.  Single-transition  signals  -  single  channel  model 

1 .  Frequency  Modulated  Tones 

a.  Roving  frequency:  glide-step  Discrimination  (see  paragraph  below) 

b.  Sinusoidal  vs.  Linear  Trajectory 

Work  began  in  January  94  on  the  detection  of  sinusoidal  FM  added  to  a  linear 
FM  sweep.  To  discern  the  effects  of  roving  frequency  on  these  tasks  we  incorporated 
frequency  rove  into  the  design  of  this  set  of  experiments  rather  than  conducting  a  step 
vs.  glide  experiment  with  roving  frequency.  Our  results  have  bean  reported  at  the 
June  94  meeting  of  the  Acoustical  Society  and  at  the  10th  International  Symposium  on 
Hearing  at  Irsee,  Bavaria.  A  manuscript  is  in  progress  that  will  be  submitted  to  the 
Acoustical  Society  journal. 

c.  Slope  Discrimination 

This  area  was  the  topic  of  the  doctoral  dissertation  of  Chien  yeh  Hsu.  His 
dissertation  was  completed  in  the  summer  quarter  1 993,  and  he  reports  that  a 
manuscript  is  in  progress.  Since  July  1993  he  has  held  a  post-doctoral  position  at  the 
University  of  Illinois. 

2.  Moving  Filter 

a.  Variation  on  the  glide-step  Discrimination  task 

We  have  by-passed  these  proposed  follow  up  versions  of  the  original  design 
because  we  decided  that  most  of  the  questions  raised  could  be  answered  by 
incorporating  the  variations  into  sinusoidal  plus  linear  FM  designs. 

b.  Sinusoidal  vs.  Linear  Trajectory 

These  experiments  have  not  been  started.  We  expect  that  they  will  be 
underway  in  the  second  year  of  the  project. 

c.  Slope  Discrimination 

The  discrimination  of  the  slope  of  the  linear  trajectory  for  a  single  resonator 
filter  was  incorporated  into  the  dissertation  of  Hsu  (see  1  .c  above).  Results  were 
reported  at  the  1994  meeting  of  the  ARO  and  we  expect  a  manuscript  to  be  submitted 
soon. 


3.  Single  Formants  from  “Real  Speech” 

This  work  has  not  been  started.  We  expect  that  it  will  begin  at  the  end  of  year 


two  or  the  beginning  of  year  three. 

B.  Multi-formant  signals  -  multi-channel  model 

Work  on  the  multi-channel  IWAIF  model  was  incorporated  into  the  master’s 
thesis  of  M.  Mokheimer,  who  applied  it  to  the  detection  of  mixed  modulation  by  human 
listeners.  A  presentation  based  on  the  thesis  is  to  be  given  in  Cairo  in  Dec.  1994. 
Development  of  the  model  has  continued  in  year  one,  with  the  experimental  work  on 
moving  filter  and  “real  speech"  signals  to  follow  in  years  two  and  three. 

1 .  Moving  filter  multiple-formant  signals,  (see  paragraph  above) 

2.  Multi-formant  signals  extracted  from  real  speech  (see  paragraph  above) 

C.  Incorporation  of  Envelope  Cues 

We  have  conducted  a  series  of  experiments  suggested  by  the  work  of  Versfeld 
and  Houtsma,  and  presented  preliminary  results  at  the  June  94  meeting  of  the 
Acoustical  Society.  As  is  often  the  case,  the  experiments  have  raised  more  questions 
than  they  answered,  and  we  continue  to  work  on  this  area. 
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Intensity-weighted  average  of  instantaneous  frequency  as  a  model 
for  frequency  discrimination 
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The  intensity-wdghted  average  of  instantaneous  frequency  (IWAIF)  is  developed  as  a  model  to 
predict  listener  performance  in  tasks  primarily  requiring  frequency  discrimination.  IWAIF  is 
closely  related  to  the  envelope  weight^  average  of  instantaneous  frequency  (EWAIF)  model 
proposed  by  Feth  for  similv  tasks.  The  primary  difference  is  that  the  IWAIF  model  uses 
intensity  (envelope  squared)  as  the  weighting  function  instead  of  the  envelope.  The  advantages 
of  rWA^  over  EWAIF  are  that  (a)  it  has  a  convenient  frequency  domain  interpretation;  and 
(b)  it  is  much  simpler  to  compute  than  the  EWAIF.  The  IWAIF  is  the  “center  of  gravity”  of 
the  energy  spectral  density  function 'of  the  signal. 

PACS  numbers:  43.66.Ba,  43.66.Fe  [HSC] 
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INTRODUCTION 

The  envelope-weighted  average  of  instantaneous  fre¬ 
quency  (EWAIF)  model  was  developed  nearly  two  dvV 
osdes  ago  by  Feth  ( 1974)  to  account  for  the  discriminabil- 
hy  of  two-Ume  complexes.  Helmholtz  ( 1954)  reported  that 
the  pitch  of  a  two-component  complex  tone  is  shifted  to¬ 
ward  the  frequency  of  the  component  whose  amplitude  is 
increased  sli^tly.  Helmholtz  attributed  the  pitch  shift  to 
fluctuations  in  the  instantaneous  frequency  of  the  two-tone 
complex.  Feth  and  co-workers  (Feth,  1974;  Feth  and 
O'Malley,  1977;  Feth  et  aL,  1982)  have  studied  the  dis- 
criminability  of  complementary  pairs  of  two-tone  com¬ 
plexes  (Voeldter,  1966a,b).  Feth  showed  that  the  pitch 
differences  are  proportional  to  the  EWAIF  diflierences  be¬ 
tween  the  complex  signals.  That  is,  the  EWAIF  is  calcu¬ 
lated  for  each  signal  of  the  pair.  The  disciimiiubility  is 
predicted  as  that  of  pure  tones  with  frequendes  at  the 
EWAIF  values.  The  EWAIF  model  has  bm  used  to  ex- 
phtn  a  variety  of  discrimirution  tasks  where  the  spectral 
pitch  of  the  stimulus  is  the  dominant  cue.  Feth  and  Stover 
(1987)  extended  the  model  to  explain  an  anomaly  in  data 
fdating  to  “profile  signals”  (Oreen,  1988).  The  central 
theme  of  this  modd  b  that  for  certain  signal  pairs,  listeners 
tae  spectral  pitch  diflferences  to  discriminate  between  them. 
^  Eeth's  modd  attempts  to  quantify  the  pitch  changes  ob- 
y*d>le  in  the  dboimination  of  comp^  stimuli.  In  the 
of  profile  signals,  it  b  assumed  that  changes  in  spectral 
of  the  profile  signab  produce  a  noticeable  change  in 
percdved  pitch. 

Thb  paper  introduces  a  rdated  model,  the  intensity- 
average  of  instantaneous  frequacy,  IWAIF.' 
■tain  difiineaoe  between  the  IWAIF  of  a  signal  and  its 
*  fhe  choice  of  a  weighting  function.  While  the 
‘  envelope  b  used  to  weight  the  instantaneous  fie- 
fat  the  EWAIF  calcubtion,  the  intensity,  which  b 
to  envelope  squared,  b  used  as  the  weighting 


function  for  the  IWAIF  calculation.  Since  both  envdop( 
and  intensity  are  non-negative,  these  weighting  functiom 
arc  highly  correlated.  Thus,  similar  values  are  expected  foi 
the  EWAIF  and  IWAIF  of  the  same  signal.  Indeed,  Fetl 
etaL  (1982)  demonstrated  that  predictions  based  on  en 
vdope  and  envelope-squared  wdghts,  as  well  as,  rms  ver 
sus  arithmetic  averaging  made  little  difference  in  th' 
weighted-frequency  average  calculations. 

The  advantage  of  the  IWAIF  lies  in  computations 
effidency  and  accuracy.  Aiudytical  calculation  of  th 
EWAIF  requires  a  bit  of  algd>ra  and  trigonometry  to  de 
rive  expressions  for  the  envelope  and  the  instantaneou 
frequency.  For  two  components  (Feth,  1974)  the  analyti 
cal  solution  b  straightforward;  for  three  components  Kid‘ 
et  aL  ( 1991 )  have  produced  an  analytic  solution.  For  mor 
than  three  components,  the  analytic  approach  b  dauntin{ 

Discrete  approximations  to  the  EWAIF  calcubtio 
present  a  new  set  of  problems.  The  instantaneous  fn 
quency  must  be  calculated  by  taking  the  derivative  of  ii 
stantaneous  phase,  a  highly  noise-sensitive  process.  Als< 
the  EWAIF  may  require  the  division  of  two  near-zei 
quantities,  which  may  lead  to  underflow  errors  in  finii 
word  length  representations  of  the  values.  The  IWAIF  fo 
muladon  can  be  transformed  into  the  frequency  domai 
(Anantharan^  et  aL,  1991).  In  addition  to  avoiding  tl 
calculation  '  problems  of  the  time-domain  version  < 
EWAIF,  the  IWAIF  provides  both  computational  efi 
dency  and  a  novel  interpretation  of  its  value.  The  IWAI 
b  equivalent  to  the  spectral  “center  of  gravity.” 

First,  the  time  and  frequency  domain  rq)resentatioi 
of  the  EWAIF  are  presented.  The  IWAIF  cS  a  signal 
then  defined,  aitd  itt  rqwesentation  in  the  fiequency  d' 
main  b  derived.  The  performance  of  the  IWAIF  modd 
then  compared  to  that  the  EWAIF  modd  in  a  numh 
of  psydKMCOUStk  tasks. 
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L  EWAIF IIOOEL 

A.  EWAIF  In  tha  tima  domain 

In  general,  a  finite  energy  real  signal  s(/)  which  has  a 
Fourier  transform 

S(/)=J»-(5(r)l=  f"  sU)e-^’'^'dt  (1) 

•r  —  ao 

can  be  represented  as  (McGillem,  1979;  Voelcker, 
1966a.b). 


s(r)s:e(r)cos^(f),  0<t<r, 

=Re[e(Oe^*‘'M. 


(2) 

(3) 


where  e(r)  is  the  instantaneous  envelope,  ^(r)  is  the  in¬ 
stantaneous  phase,  and  Re[  ]  denotes  the  real  part  opera¬ 
tor.  The  instantaneous  frequency,  /((}  is  defined  as 


/(0  = 


1  d4>U) 
hr  dt 


(4) 


Such  a  representation  of  s(r)  is  not  unique.  For  example, 
e(t)  can  be  chosen  to  satisfy  (3)  for  an  arbitrary  ^(r).  A 
imique  e(t)  and  ^(t)  can  be  assured  by  imposing  an  addi¬ 
tional  ocmstraint,  namely,  that  the  real  and  imaginary  parts 
of  the  complex  signal  e(r)e^4<')  f^fm  ^  Hilbert  transform 
pair.  Such  a  complex  signal  is  termed  analytic  and  has 
certain  useful  properties.  Thus,  the  analytic  signal  corre¬ 
sponding  to  the  real  signal  s(t)  can  be  written  as 


m(0-s(0+M0, 

where 


(5) 

(6) 

(7) 


is  the  Hilbet  transform  of  s(r). 

The  envelope  and  instantaneous  frequency  functions, 
e(t)  and  /(f),  can  be  defined  in  terms  of  s(t)  and  f(f)  as 


a(t)  =  |m(t)|=ls^(t)-»-#*(t)l''^. 

(8) 

^(t)=arcran(^— J, 

(9) 

^  1  r(/)i'(/)-j'(/)5(/) 

?(75+??7)  • 

(10) 

The  envelope-weighted  average  of  instantaneous  frequency 
(EWAIF)  of  s(f)  is  defined  as 


EWAlF[s(f)]  = 


f[e(0/U)di 

foe(l)dt 


(11) 


A  cominmi  method  ctf  calculating  the  EWAIF  of  a 
signal  is  to  determine  the  envelope  and  instantaneous  fre¬ 
quency  fimctions  using  (8),  (10),  and  computing  the  re¬ 
quired  integrals  in  ( 1 1 ).  However,  there  are  some  oompu- 
rational  problems  when  we  adopt  this  method  for 
calculating  the  EWAIF  of  broadband  tignab.  Note  that 
the  expression  ( 10)  for  /(/)  involves  differentiation  which 
is  a  highly  noise-sensitive  operation. 
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B.  Fraquancy  domain  raprasantatton  of  EWAIF 

Altunatively  /(f)  can  be  expressed  in  terms  of  the 
analytic  signal  m(r),  alone  by  rewriting  (6)  as 

In  «(f)=ln|«(f)  I +y(>(f).  (12) 

Hence, 

^(f)=:Im(lnm(f)l.  (13) 

where  Im  denotes  the  imaginary  part  operator, 


1  /m'(f)\ 


(14) 


Inserting  the  above  equations  in  the  expression  for  EWAIF 
(11)  we  have 


EWAIFfj(f)]  = 


1  /o^|m(f)|Im[m'(f)/m(f)Kf 


2ir 


/o  |m(f)|<ff 


(15) 


This  can  be  expressed  in  terms  of  the  Fourier  transform  of 
^m(f)  as  (see  Appendix  A  for  the  derivation) 


/*-/|4/5(/)|V/ 


(16) 


where  Ms(/  )=,?’[ 

The  EWAIF  is  thus  the  frequency  of  the  “center  of 
gravity”  of  |Af5(/ ) |^  While  this  is  an  interesting  obser¬ 
vation,  it  is  of  little  use  in  the  computation  of  the  EWAIF 
of  a  signal  Indeed,  in  order  to  obtain  Afs(/ ),  the  square 
root  of  a  complex  signal  has  to  be  computed.  In  computing 
'jmO),  we  have  to  be  careful  to  choose  the  principal 
branch  of  the  square  root.  This  is  similar  to  the  phase 
unwrapping  problem  encountered  in  signal  processing. 
Further,  because  /(f)  has  an  e(f)  term  in  the  denomina¬ 
tor,  care  must  be  taken  in  computing  the  instantaneous 
frequency  at  points  where  the  envelope  is  zero  or  near  zero. 
This  involves  computing  a  limit  of  the  ratio  of  two  func¬ 
tions  which  approach  zero  rather  than  a  simple  division. 

II.  IWAIF 

In  computing  the  EWAIF,  the  envelope  of  the  signal  is 
used  as  the  weighting  function  for  finding  the  average  of 
the  instantaneous  frequency.  Other  weighting  functions 
may  model  listener  discriminability  as  well.  Indeed,  pre¬ 
dictions  based  on  an  envelope-squared  (intensity) 
weighted  model  have  performed  as  well  as  an  envelope 
weighted  model  (Feth  et  at,  1982).  This  is  to  be  expect^ 
as  intenaty  is  the  square  of  the  envelt^  a  non-negative 
function.  To  this  au),  let  us  investigate  the  intensity- 
weighted  (arithmetic)  average  of  instantaneous  frequency 
(IWAIF)  of  a  signal.  The  IWAIF  of  s(r)  is  defined  as 


Io^U)/{t)dt 


(H) 


where  e(0  and  /(r)  are  as  defined  in  Eqs.  (8)  and  (10), 
respectively. 

The  above  definition  of  IWAIF  was  motivated  by  our 
previous  work  with  EWAIF.  Equation  (17)  can  also  be 
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_  .r  ..- 


arrived  at  in  an  alternative  way.  Suppose  that  a  suitable 
frequency  /q  is  to  be  found  such  that  s(t)  represents  a 
modulated  wave  of  the  form 

s(/)=se(/)cos[2ir/ot+0(t)].  (18) 

=  Re(m(r)l.  (19) 

e(t)  b  thought  of  as  the  envelope  of  s(/)  and  B(t)  as  its 
phase.  m(r)  is  the  complex  analytic  signal  corresponding 
to  s(t)  as  given  in  Eq.  ( S).  For  narrow  band  e(t)  and  d(t) 
this  represents  the  modulation  of  a  sinusoidal  carrier  wave 
of  frequency  /q.  The  instantaneous  frequency  of  the  signal 
is 


/(0=/o+ 


1  dO(t) 
2v  dt 


(20) 


The  choice  of  /o  can  be  arbitrary  so  long  as  the  mathemat¬ 
ical  relations  remain  valid.  The  most  common  choice 
(McGillem,  1979)  is  to  select  /q  such  that  it  is  the  center 
of  gravity  of  jAfi/ )  |^  This  corresponds  to  the  center  of 
gravity  of  the  positive  frequency  portion  of  the  energy  den¬ 
sity  spectrum  of  the  signal.  The  required  value  for  /g  is 
that  value  which  minimizes  the  following  integral: 


J‘(/-/o)'IW)|V/,  (21) 

which  is  the  same  as  the  IWAIF  of  the  signal  s(r). 

A.  Froqiiency  domain  repraaantatlon  of  IWAIF 

Much  of  the  discussion  in  this  section  follows  that  in 
Anantharaman  (1992).  We  can  rewrite  (17)  in  turns  of 
m(r)  as 


IWAIF[j(r)]  = 


1  /f|m(0|^lm(m'(r)/m(r)lrfi 


2ir 


folm(r)l^dt 


(22) 

Invoking  Parseval’s  relation  this  becomes  (see  Appendix 
B) 


^AiF(^(^)^=-7j:iwTiW- 

This  can  be  further  simplified  by  taking  advantage  of  the 
one-sided  nature  of  A/C/ )  and  its  relation  to5(/  ).  Equa¬ 
tion  (23)  then  becomes 


rWAlF[s(t))  = 


foi^(/)iW  ■ 


(24) 


Thus,  the  IWAIF  of  a  real  signal  is  located  exactly  at  the 
“center  of  gravity”  of  the  positive  portion  of  its  energy 
density  spectrum.  The  above  freque^  domain  rq>resen- 
tathm  (24)  provides  a  simple  and  eIBcieat  procedure  for 
computing  tte  IWAIF  of  a  signaL  Compare  this  with  (16) 
whidi  is  the  fitequency  domain  expression  for  the  EWAIF. 
Using  (24)  dubmates  most  of  the  difficulties  encountered 
in  computing  the  EWAIF  of  the  signal  The  IWAIF  b 
completely  described  by  the  energy  q)ecttuffl  of  the  sfgiw/ 
alone.  Tl^  obviates  the  need  to  compute  a  Hilbert  trans¬ 
form  and  a  derivative.  All  that  needs  to  be  computed  b  the 


Fourier  transform  of  s(().  This  can  be  done  efficiently  us¬ 
ing  the  FFT  algorithm. 

Suppose  s(t)  is  sampled  at  a  rate  F,  to  yield  A  sam¬ 
ples,  11=0,1,. ...A— I,  and  its  A-point  FFT  is 
5Ilc],  Jc=0, A—  1.  Then,  the  IWAIF  of  s(i)  can  be 
computed  as 


IWAIF(s(r)l  = 


2lilo’"'|5(k)l^A/ 


2il'o^>-“|5(A:)|^ 


CS) 


where  A/sF/A  b  the  frequency  spacing  between  sam¬ 
ples  of  the  FFT. 


Ill  COMPARISON  OF  IWAIF  AND  EWAIF 
PREDICTIONS  WITH  PSYCHOACOUSTIC  RESULTS 

As  mentioned  earlier,  the  main  difference  between  the 
IWAIF  of  a  signal  and  ib  EWAIF  b  in  the  choice  of  a 
weighting  function.  While  the  envelope  b  used  to  weight 
the  instantaneous  frequency  in  calculating  the  EWAIF,  the 
intensity  (envelope  squared)  b  used  as  the  weighting  func¬ 
tion  in  IWAIF  calcubtions.  Since  both  envelope  and  in¬ 
tensity  are  non-native  and  the  btter  b  the  square  the 
former,  the  weighting  functions  are  highly  correbted. 
Thus,  similar  values  are  expected  for  the  EWAIF  and 
IWAIF  of  a  signal.  For  a  simple  sinusoid,  both  the 
EWAIF  and  the  IWAIF  values  are  equal  to  the  tone  fre¬ 
quency  /q.  For  a  combination  of  two  tones  of  the  same 
amplitude  the  EWAIF  and  IWAIF  values  ate  again  equal 
and  are  located  at  the  mean  of  the  two  frequencies.  It  b 
difficult  to  calcubte  the  EWAIF  of  a  combination  of  three 
or  more  tones  analytically.  However,  the  IWAIF  of  an 
A-component  complex  can  be  easily  calcubtcd.  Assuming 
r  to  be  much  larger  than  the  maximum  of  all  the  tone 
periods,  the  IWAIF  of  a  sum  of  sinusoids  such  as 


s(r)=  Xa,cos(2ir//r),  0<.t<r 

t 

is  approximately  equal  to  the  weighted  mean 

2,  a?/. 

IWAIF{j(r)]=^^. 


(26) 


(27) 


The  above  relation  would  have  been  exact  had  the  sinuso¬ 
ids  extended  in  time  from  —  oo  to  -1-  oo .  For  finite  duration 
signab  being  considered  here  the  approxinuition  geb  better 
as  the  duration  T  increases. 

To  illustrate  the  comparable  predictions  of  EWAIF 
and  IWAIF  models,  we  present  modd  predictions  for 
normal-hearing  Ibtener  performance  in  frequency  discrim¬ 
ination  and  spectral  pitch  matching  experiments  rqxMted 
previously. 


A.  TWo-compononL  common  onvolope  complex 
tones 


Feth  et  aL  ( 1982)  asked  four  well-practioed  listeners 
to  distinguish  between  pairs  of  common  envdope,  complex 
tones  (see  Vodeker,  1966a,b  for  a  dismission  of  oornmon 
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TABLE  L  BWAIF  aad  IWAIP  valuca  for  coapkmailary  Vodcksr  tif- 
pain  diffcrim  ia  (ioqucncy  by  A/  Hi  aad  ia  iaUaMy  by  A/  dB.  The 
•vifige  P(0  «aloMfarfBarMib|eeltaleocairr*Tei|ucocyar2000Hzere 
■bo  ihoMt  PfodiciMl  ifiEiwria  an  iadcpeadcat  of  the  cealer  ftcqueacy. 
Note  that  aa  iaenaac  ia  AO  lauh*  ia  a  cocreyiBdiag  iactcaae  in 
AEWAIF  or  AIWAIF  which  b  juit  ■■  expected  froai  the  frequeacy  db- 
criminatioa  data  for  pure  toaca. 


V 

(Hz) 

01 

(dB) 

Avense 

«C) 

Picdictad  pilch  nau  j 

Jiffereacci 

AEWAIF 

(Hi) 

AIWAIF 

(Hz) 

10 

as 

63.3% 

at 

0.6 

1.0 

69.0% 

1.4 

1.2 

3.0 

S3.0% 

3.7 

3  3 

20 

0.5 

5% 

1.6 

1.2 

1.0 

71.0% 

2.9 

2.3 

3.0 

»3.0% 

7.3 

67 

30 

0.3 

63.8% 

4.1 

2.9 

1.0 

77.3% 

7.2 

3.8 

3.0 

87.3% 

18.2 

167 

100 

0.3 

67.8% 

8.1 

3.8 

1.0 

77.8% 

14.3 

11.3 

3.0 

96.0% 

363 

33.2 

envelope  signmls).  In  Addition,  the  listeners  were  required 
to  match  the  pitch  they  heard  for  each  signal  in  the  pair  to 
that  in  a  two-component  signal  with  equal  amplitude  com¬ 
ponents.  The  equal-amplitude  signal  was  adjustable  along 
the  frequency  axis  to  enable  the  spectral  pitch  match.  In 
that  study,  Feth  et  aL,  presented  both  predicted  pitch  val¬ 
ues,  based  on  EWAIF  calculations  and  the  averaged  pitch 
match  for  each  signaL  They  assumed  that  the  EWAIF  for 
each  of  the  signals  of  a  given  pair  represented  the  fre¬ 
quency  of  a  sinusoid  that  would  produce  the  same  spectral 
pitch.  The  difference  between  EWAIF  values  was  assigned 
a  predicted  percentage  of  correct  discriminations.  Feth 
( 1974)  had  found  that  such  predictions  were  good  indica¬ 
tors  of  listener  performance  for  discrimination  between 
pairs  of  common  envelope  signals.  In  Table  I,  we  present 
selected  discrimination  results  from  Feth  etaL  (1982) 
along  with  predicted  EWAIF  and  IWAIF  model  predic¬ 
tions.  As  previously  noted  by  Feth  etaL,  differences  in 
IWAIF  values  for  a  given  signal  pair  tend  to  be  slightly 
smaller  than  the  equivalent  EWAIF  values,  but  the  differ¬ 
ences  are  consider^  negligible. 

B.  SlimittaiMOus  ampRtudB  and  frequancy  modulation 

Iwamiya  etaL  (1984)  studied  the  prindpal  pitch 
heard  by  listeners  in  signals  designed  to  approximate  vi¬ 
brato  in  musical  sounds.  Note  that  principal  pitch  is  an¬ 
other  term  for  spectral  pitch.  Complex  sounds  were  gener¬ 
ated  by  modulating  a  sinusoidal  carrier  both  in  frequency 
and  amplitude  with  the  same  modulating  signal.  The  car¬ 
rier  (at  frequencies  of  440,  880,  or  1500  Hz)  was  modu¬ 
lated  by  a  low-frequency  (6  Hz)  triangular  waveform. 
Thus,  if  Djm  is  the  “degree  of  AM”*  and  » the  “ex¬ 
tent  of  FM  in  cents,”*  the  modulated  signal  for  a  carrier  /, 
is  given  by 


j(t)  —  ( 1  +  ]  I 

1 

Xcoa|2ir/,/-*-0.5£FM  J  m(T)dT^,  (28)  | 

where  tti(t)  is  the  modulating  signal. 

Listeners  were  asked  to  match  the  pitch  they  heard  in 
these  modulated  tones  to  that  of  a  pure  tone.  The  experi¬ 
ment  was  conducted  for  AM  and  FM  modulations  pre¬ 
sented  “in-phase”  and  “out-of-phase."  That  is,  the  modu¬ 
lations  were  “in-phase"  when  a  frequency  increase  was 
accompanied  by  an  amplitude  increase.  Out-of-phase,  then, 
meant  that  frequency  increase  accompanied  an  amplitude 
decrease.  Also,  they  conducted  some  pitch  matches  with 
the  degree  of  AM  set  to  1.0  with  the  extent  of  FM  taking 
values  of  0,  25.  50,  and  1(X)  cents.  Other  trials  held  the 
extent  of  FM  at  100  cents  while  the  degree  of  AM  was  0.0, 
0.50,  0.75.  or  1.00. 

Results  taken  from  Iwamiya  el  aL,  are  plotted  in  Fig. 

1.  Also  shown  in  Fig.  1  are  spectral  pitch  values  predicted 
by  the  EWAIF  and  IWAIF  models.  Note  that  the  listeners 
exhibit  a  small  negative  bias.  Matches  to  simple  AM  tones, 
which  should  be  at  the  carrier  frequency  (i.e.,  zero  differ¬ 
ence  from  ff)  fall  a  few  cents  below.  The  EWAIF  and 
IWAIF  models  cannot  account  for  this  bias;  however,  they 
both  produce  predicted  spectral  pitch  match  results  for 
“in-phase"  and  “out-of-phase"  conditions  that  agree  with 
listener  performance  as  extent  of  modulation  is  increased. 

In  general,  the  models  predict  somewhat  larger  pitch  dif¬ 
ferences  between  signal  pairs  (in-phase  versus  out-of¬ 
phase)  as  compared  to  the  listeners. 


C.  AppUcatton  to  aignala  uaed  In  profllo  analysiB 
atudlaa 

F^  and  Stover  (1987)  attempted  to  extend  the 
EWAIF  model  to  the  complex  signals  used  in  many  of  the 
early  profile  aiulysis  experiments.  In  those  studies  the  Us- 
tener  was  asked  to  detetmine  which  of  two  signals,  con¬ 
sisting  of  a  number  of  sinusoids,  contained  a  small  incre¬ 
ment  to  the  amplitude  of  the  sinusoid  in  the  center.  To 
deter  the  use  of  absolute  intensity  discrimination  in  deter¬ 
mining  which  complex  sigiud  contains  the  increment,  the 
overall  level  of  each  presentation  is  selected  at  random 
from  a  range  of  level  values.  This  is  commonly  called  a 
“roving  level”  paradigm.  The  assumption  is  that  since  the 
listeners  will  be  unable  to  use  simple  (absolute)  intensity 
cues,  they  will  be  forced  to  base  discrimirution  decisions 
cm  the  difference  in  overall  spectral  shape,  or  pre^e,  of  the 
complex  rignals.  This  assumption  ignores  the  possibility 
that  listeners  may  respond  to  okher  cues  available  in  the 
incremented  signal.  For  example,  interac;ti(nis  among  the 
inharmonically  spaced  sinusoicis  that  make  up  the  complex 
signal  can  lead  to  frequency  modulations  (Hid).  Further, 
adding  a  small  increment  to  the  amplitude  eff  one  sinusoid 
in  the  eom|^  can  lead  to  a  change  in  the  FM  produced  by 
the  interaction.  Such  changes  in  FM  may  be  audible  as 
subtle  pitch  shifU.  The  size  of  the  frequency  shift  produced 
by  an  increment  to  one  component  depends  only  on  the 
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TABLE  II  EWAil'  end  IWAIE  based  predictions  of  pilch  diScrenccs 
between  mullicompuneni  equal-ampiitude  profile  ugiuls  and  their  incre- 
mcnled  counterparts  (with  threshold  increments)  Note  that  while 
threshold  increment  gels  smaller  as  more  components  are  added,  there  is 
little  change  in  the  AEWAlF  and  AIWAIF  values  indicsiing  equal  lis¬ 
tener  performance. 


Number  of  components 
in  profile  complei 

Threshold  increment 

a; 

(dB) 

Model  predictions 

AEWAIF 

(Hz) 

AlWAlF 

(Hz) 

3 

-0  1 

21. 1 

17.3 

5 

-4.2 

29.5 

25.8 

7 

-11.2 

21.8 

18.0 

9 

-13.0 

23.7 

20.0 

II 

-13  9 

33.6 

29  8 

FIG.  1.  Localiied  principal  pitcli,  EWAIF  and  IWAIFof  FM-AM  tones 
at  a  liinctioa  of  die  extent  of  FM  with  the  frequency  and  ampUtude 
modiilationt  both  in-phase  (lines  with  positive  dope)  and  anti-phase 
(tinea  with  neiative  slope)  (Iwamiya  er  aL,  1984).  The  solid  lines  are  a 
regrotioa  line  St  to  the  oolketed  ^ta  represented  by  solid  cxiea.  The 
open  tiianglet  and  open  squares  represent  the  IWAIF  and  the  EWAIF 
predictioos,  rcapectively.  (a)  440-Hz  carrier  frequency,  (b)  880-Hz  car¬ 
rier  frequency. 

relative  amplitudes  of  the  components,  not  the  absolute 
levels.  Thus,  the  FM  produced  would  be  unaffected  by 
“roving  level”  procedures. 

Early  profile  investigators  were  puzzled  by  an  anoma¬ 
lous  result  that  occurred  when  discrimination  performance 
was  observed  as  the  number  of  components  in  the  profile 
signal  was  increased  (Green  etaL,  1983;  Green  etaL, 
1984).  As  the  number  of  components  was  increased  fran  3 
to  1 1  sinusoids,  the  just-detectable  increment  was  found  to 
be  progressively  ampler.  That  is,  it  appeared  that  listeners 
were  more  sensitive  to  an  increment  in  1  of  1 1  components 
than  they  were  to  an  increment  in  1  of  3  or  S  components. 
The  puzzle  was  to  explain  how  adding  additional  compo¬ 
nents  to  the  complex  signal  could  lead  to  such  enhanced 
performance.  However,  if  we  consider  the  FM  produced  by 
an  increment  to  the  center  compmient  in  the  complex  sig¬ 
nal,  we  may  find  an  alternative  explanation  for  listener 
behavior. 

The  increment  in  the  ampUtude  of  the  signal  compo¬ 


nent  leads  to  a  small  difference  in  the  EWAIF  values  cal¬ 
culated  for  the  standard  and  the  target  complex  sounds. 
These  EWAIF  differences  reflect  the  difference  in  FM  be¬ 
tween  these  sounds.  Feth  and  Stover  (1987)  showed  that 
the  EWAIF  difference  between  the  just-detectable  target 
and  the  standard  was  approximately  the  same,  independent 
of  the  number  of  components  in  the  profile  signals.  Thus, 
white  a  profile  analysis  approach  is  unable  to  explain  the 
listeners’  improvement  in  performance  with  increasing 
numbers  of  components,  the  EWAIF  provides  a  successful 
explanation.  Progressively  smaller  increments  in  the  am¬ 
plitude  of  the  central  sinusoid  of  complex  signals  made  up 
of  larger  numbers  x..  components,  leads  to  approximately 
the  same  difference  in  ^WAIF.  Feth  and  Stover  (1987) 
argued  that  the  detection  of  frequency  modulation  was  a 
more  parsimonious  explanation  of  the  phenomenon.  Table 
II  gives  the  comparison  of  just-detectable  ampUtude  incre¬ 
ments  across  component  number  with  the  corresponding 
EWAIF  and  IWAIF  values  that  each  increment  produces. 
Note  that  EWA  IF  and  IWAIF  predictions  arc  quite  com¬ 
parable. 

IV.  CONCLUSIONS 

The  intensity-weighted  average  of  instantaneous  fre¬ 
quency  (IWAIF)  model  has  been  presented  as  an  alterna¬ 
tive  to  the  envelope-weighted  average  of  instantaneous  fre¬ 
quency  (EWAIF)  model.  Calculation  of  the  EWAIF  of  a 
signal  involves  determining  the  envelope  and  the  instanta¬ 
neous  frequency  functions  of  the  sig^  separately.  This 
can  be  computationally  cumbersome  espedaUy  as  the 
bandwidth  of  the  signal  gets  wider.  The  IWAIF  of  a  signal, 
on  the  other  hand,  can  be  expressed  Mlely  in  terms  of  the 
magnitudi*  spectrum  of  the  signal.  Such  a  frequency  do¬ 
main  representation  provides  a  fast  and  efficient  method  to 
compute  the  IWAIF  of  a  signal  using  the  FFT  algorithm. 

The  IWAIF  model  was  tested  on  three  sets  of  stimuU 
viz.  Voelcker’s  complementary  two-tone  oonq>lexes  used  m 
experin'ents  by  Feth  and  oo-svorkers  (Feth,  1974;  Feth 
etaL,  1982;  Feth  and  O’Malley,  1977),  FM-AM  tones 
used  by  Isvanuya  et  aL  (1984),  and  profile  signals  used  by 
Green  etaL  (1984).  The  performance  of  the  IWAIF 
model  was  found  to  be  comparable  to  that  of  the  EWAIF 
model. 
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APPENDIX  A:  EWAIF 

The  frequency  domain  representation  of  the  EWAIF 
^  be  derived  as  follows,  llie  EWAIF  of  s(t)  can  be 
written  as 


EWAIF=- 


fo^iOdt 


1  fa  1 Jm[m’(t)/m(t)]dt 

~2ir  fllminldf  • 

Consider  the  numerator, 

rr  1  rr 

Jo  Jo 

1  , - m'(l) 

=2^*”  Jo 


where  m*(/)  denotes  the  complex  conjugate  of  m(l) 


1  , -  m’(t) 

=^Im  (A4) 

2ir  Jo  imit) 

=-Im  f  I  fmitj ] ' (  /m (7) ]*dt.  (A5) 

w  Jo 

Applying  the  theorem  for  the  Fourier  transform  of  the 
derivative  of  a  signal  and  invoking  Parseval’s  theorem,  the 
numerator  can  be  expressed  as 

J"^  e(/)/(r)d/*^  Im  J*  jlvfMsif  )iK|(/  )df, 

(A6) 

*2  r  f\Ms{f)\^df.  (A7) 

•»  —  fp 

where  Afs(f )  =  fmiOJ  is  the  Fourier  transform. 
Similarly,  the  denominator  can  be  expressed  as 

JJ  lm(0 1*=  JJ  Mn]*d(.  (A8) 

=  J*‘  iMs(/)i^d/.  (A9) 


Hence, 


EWAIF=2 


lTjMs(/)lW  ■ 
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APPENDIX  B:  IWAIF 


!W(t)f(.t)dt 

1  Jo\m(t)\^lm[m'{t)/m(t)]dt 
~2ir  fl\m(t)\^dt  ' 


In  the  frequency  domain  the  numerator  can  be  expressed 
as 

Im  j' 

(B2) 

=:Jlm  r m'(t)m*U)dt.  (B3) 

2ir  Jo 

=^Im  J“  J2ir/Af(/ )M*(/ )d/, 

{B4) 

=  f"  f\M(f)\^df.  (B5) 

The  expression  for  the  Fourier  transform  of  the  differential 
of  a  signal  as  well  as  Parseval’s  relation  were  made  use  of 
in  the  foregoing  simplification.  Again,  by  Parseval's  rela¬ 
tion,  the  denominator  is 


r  ^u)dt=  r  \m(t)\^dt, 
Jo  Jo 

J  —  KB 


Hence, 


IWAIF[s(01  = 


f'U\M{f)\^df 
fZJM(f)\^df  • 


In  order  to  derive  the  frequency  domain  representation 
of  IWAIF  consider 


'The  IWAIF  bw  also  been  refened  to  as  the  s<)uared.«nvelope-weighted 
average  of  instantaneous  frequency  (SEWAIF). 

‘Degree  of  amplitude  modulation  (AM)  is  given  by 

‘Extent  of  frequency  modulation  (FM)  is  given  in  cents  as 

1200  log, 


Anantharaman,'!.  N.  (1992).  “A  multichannel  signal  processing  model 
for  complex  sound  discrimination,”  Master's  thesis.  The  Ohio  State 
University,  Columbus,  OH. 

Anantharaman,  I.  N.,  Krishnamunhy,  A.  K.,  and  Feth,  L.  L.  (1991). 
“Auditory  processing  of  complex  signals  using  the  multichannel 
EWAIF,”  J.  Acoutt.  Soc.  Am.  S9.  1938-1939  (A). 

Feth,  L.  L.  (1974).  “Frequency  discrimination  of  complex  periodic 
tonea,”  Pcrc^  Psycbophyi.  15,  375-378. 

Feth,  L.  L.,  and  O’Malley,  H.  (1977).  "Two-tone  auditory  spectral  res¬ 
olution.”  3.  Acoutt  Soc.  Am.  62,  940-947. 

Feth,  L.  L..  O’Malley.  H..  and  Ramsey,  J..  Jr.  (1982).  “Pitch  of  unre¬ 
solved,  two-component  complex  tones,”  J.  Aooust  Soc.  Am.  72, 1403- 
1412. 

Feth,  L.  L.,  and  Stover,  L.  J.  (1987).  "Demodulation  prooeaset  in  audi¬ 
tory  perception,”  in  Auditory  Froettsing  of  Complex  Sounds,  edited  by 
W.  A.  Yoet  and  C  S.  Watson  (Erlbaum,  Hillsdaie,  NJ).  pp.  76-86. 

Oteen,  D.  M.  (1988).  Aq/fJr  Analysis;  Auditory  Intensity  Dberimination 
(Oxford  U.P..  New  York). 


Oreen,  D 
versus  s> 
J.  Aoou 
Oreen,  D 
Critical 
Iwamiya. 
dpal  pi 
queitcy 
phase.” 
Kidd.O. 
P.  (19i 


728  J.  AootNL  Soc.  Am.,  Voi.  84.  No.  2.  Pt  1,  August  1893 


Anantltsraman  et  ai.:  IWAIF  726 


7! 


U) 

icd 


12) 

‘3) 


14) 


=5) 

ial 

of 

Ia~ 


6) 


■  7) 


8) 


led 

by 

as 


del 

ale 

1). 

nel 

lie 

et- 


re- 

3- 


di- 

by 

on 


Green,  D.  M.,  Kidd,  G.,  Jr.,  and  Picardi,  M.  C.  (1983).  “Successive 
versus  simultaneous  comparison  in  auditory  intensity  discrimination," 
J.  Acoust.  Soc.  Am.  73,  639-643. 

Green,  D.  M.,  Mason,  C.  R.,  aiul  Kidd,  G..  Jr.  ( 1984).  "Profile  analysis: 
Criticai  bands  and  duration,"  J.  Acoust.  Soc.  Am.  7S,  1163-1167. 

Iwamiya,  S.,  Nishilcawa.  S..  and  Kilamura,  O.  (1984).  "Perceived  prin- 
dpai  pitch  of  FM-AM  tones  when  the  phase  diffcrenoe  between  fre¬ 
quency  modulation  and  amplitude  modulation  is  in-phase  and  anti¬ 
phase,"  J.  Acoust.  Soc.  Jpn.  S,  S9-69. 

Kidd,  G.,  Jr.,  Mason,  C  R.,  Uchanski,  R.  M.,  Brantlet,  M.  A.,  and  Shah, 
P.  (1991).  "Evaluation  of  simple  models  of  auditory  profile  analysis 


using  random  reference  spectra,"  J.  Acoust.  Soc.  Am.  90,  1340-1334. 
McGillem,  C.  D.  (1979).  Hilbert  Transforms  and  Analytic  Signals  (un¬ 
published  notes  on  signal  processing,  Purdue  University). 

Voeicker,  H.  B.  (1966a).  “Toward  a  unified  theory  of  modulatio.-) — Part 
I:  Phase-envelope  relationships,"  Proc.  IEEE  M,  340-333. 

Voeicker,  H.  B.  (1966b).  "Toward  a  unified  theory  of  modulation — Part 
II:  Zero  manipulation,"  Proc.  IEEE  54,  733-733. 
von  Helmholtz,  H.  L.  F.  ( 1954).  Oa  the  Seiuatkms  qf  Tone  ai  a  Physio¬ 
logical  Basis  for  the  Theory  of  Music  (Dover,  New  York),  2nd  English 
edition,  p.  163  and  Appendix  XIV. 


28 


729  J.  AooML  Soc.  Afn„  Vol.  94,  No.  2,  Pt  1,  Augutl  1993 


Ananttwiwmn  of  of;  IWAIF  72i 


T,7r,.A^--  'y-  ‘  ^ 

j  kFC^  iM'H/  iA 

Detection  of  Combinations  of  Frequency  Modulation: 

An  Application  of  the  I  WAIF  Model 

Lawrence  L.  Feth,  Ashok  K.  Krishnamurthy*  and  Tao  Zhang 

Speech  and  Hearing  Science  and  *EIectrical  Engineering,  The  Ohio  State  University 
Columbus,  Ohio  43210  USA 


1  Introduction 

Our  work  has  been  guided  by  the '  underlying  assumption  that  human  auditoiy 
communication  is  a  modulation  -  demodidation  process.  That  is,  we  assume  that  sources  produce 
a  complex  stream  of  sound  pressure  waves  with  information  encoded  as  variations  (i.e., 
modulation)  of  signal  amplitude  and  fiequency.  Speech,  music  and  most  environmentally 
important  sounds  can  be  described  in  this  way.  Recently,  Maragos,  et  al.,  (1992)  have  shown  that 
an  energy-tracking  operator  can  be  ^)plied  to  speech  signals  to  produce  an  algorithm  that  tracks 
speech  formant  amplitude  and  fiequency.  The  result  of  the  energy-tracking  operation  is  an 
amplitude-by-firequency  product  that  is  quite  amilar  to  the  EWAIF — ^IWAIF  calculations  used  in 
our  previous  work  (Antmtharaman,  et  al.,  1993).  Earlier  work  by  Teager  and  Teager  (1990) 
showed  that  the  production  of  speech  could  be  modeled  as  the  modulation  of  amplitude  and 
frequency  of  each  formant.  Findbeig,  et  al.,  (1992)  have  also  b^un  to  apply  the  modulation 
model  to  speech  recognition  problems. 

The  hunuun  listener’s  task  is  then  modeled  as  one  of  demodulating  the  sound  stream.  Much 
of  the  past  work  in  psychoacoustics  might  be  characterized  as  “spectrum  picture  processing.” 
That  is,  complex  sounds  are  Fourier-aiudyzed  into  an  amplitude-by-frequency  picture.  The 
experimenter  then  models  auditory  perception  as  a  process  of  analyzmg  this  “spectrum  picture.” 
The  work  on  “profile  analysis”  (see  Green,  1988)  could  be  described  in  this  manner.  The 
spectrum  picture  processing  approach  leads  to  studies  of  broad  bandwidth,  complex  sounds  in 
masking  or  discrimination  e}q)erinients.  Our  ‘*mo-dem”  approach  leads  us  to  investigate  time- 
varying,  complex  sounds.  We  suggest  that  understanding  the  auditory  proces^g  of  such  signals 
with  dynamic  spectra  is  essential  to  better  undoatanding  human  auditory  perception. 

1.1  Previous  work  with  STEP-GLIDE  signals 

The  Envelope-Weighted  Average  of  Instantaneous  Frequency  (EWAIF)  model  was 
originally  develop^  to  predict  the  q}ectral  pitch  of  ruurow  bandwidth  signals  for  normal  hearing 
human  listeners  ^eth,  1974;  Feth,  et  al.,  1982).  We  moved  to  the  study  of  firequency  tran^ons 
(Feth,  et  al.,  1989)  because  formant  tran^ons  are  essential  in  the  description  of  consoiumts  in 
speech  (e.g.,  Uberman,  et  al.,  1956).  Our  experimental  work  was  first  designed  to  determine  the 
appropriate  width  of  the  ear’s  temporal  window  fin  the  processing  of  sudi  signals.  Several 
determinations  of  the  limits  of  auditory  temporal  acuity  are  given  m  the  literature  (e.g..  Green, 
1991,  1973  a  &  b,  I98S).  Moore  and  his  colleagues  have  reported  the  equivalent  reoangular 
duration  (ERD)  for  a  tenqxHal  window  tfaid  is  the  fime-domain  analog  of  the  roex  filter  shape  of 
the  peripheral  filter  bank  (Moore,  et  al.,  1988;  Plack  and  Moore,  1990,  1991).  Ijttle  of  this 


previous  work  has  used  dynamic  signals,  that  is,  signals  with  frequency  transitions.  To  determine 
a  measure  of  temporal  acuity  for  such  signals,  we  devised  a  discrimination  task  in  whidi  listeners 
were  asked  to  distinguish  between  a  tone  frequency  modulated  over  a  linear  trajectory  (a  GLIDE) 
and  one  covering  the  same  frequency  change  via  a  multiple-step  trajectory  (the  STEP).  Details 
are  given  in  Madden  and  Feth  (1992),  but  the  essential  result  was  that  the  just-discriminable  step 
was  approximately  7  to  10  ms  for  frequencies  of  2  kHz  and  below.  Above  2  kHz,  the  just- 
discriminable  step  becomes  longer. 

Applicadon  of  the  origmal  EWAIF  modd  to  dynamic  dgnals  proved  to  be  difficult  because 
of  the  complexity  of  the  EWAIF  calculation.  The  remit  was  the  IWAIF  modd  (Anantharaman,  et 
al.,  1993).  Here,  intensity  is  used  to  wdght  the  frequency  values.  The  advantage  is  that  the 
IWAIF  can  be  calculated  in  the  frequency  domdn  udng  an  FFT,  whereas  the  EWAIF  was 
calculated  in  the  time  donuun.  In  addition  to  tl^  great  improvement  in  computational  efficiency, 
the  IWAIF  form  of  the  model  has  led  to  an  int^retation  of  the  model  output  that  appears  to 
have  wide  applicability  in  our  further  understanding  of  auditory  signal  processing.  The  result  of 
the  IWAIF  calculation  is  the  “center  of  gravity”  of  the  dgnal  spectnun.  Such  a  ample  concept 
has  great  intuitive  rqrpeal  for  predicting  the  locus  of  spectral  pitcii  for  many  sounds.  It  is  also  the 
basis  for  the  “perceptual  formant”  suggested  by  Chistovitch  (1979)  as  the  determinant  of  vowel 
quality. 

A  short-term  model  (ST-IWAIF)  can  be  applied  to  signals  with  frequency  transitions 
(Krishruunurthy  and  Feth,  1993).  For  example,  the  STEP  and  GLIDE  signals  have  nearly  the 
same  long-term  spectrum,  and  consequently,  the  same  long-term  IWAIF  values.  Nonetheless, 
they  are  easily  discriminable,  indicating  that  hunum  listoiers  are  able  to  utilize  cues  that  are  more 
short-term  in  nature.  To  explain  such  data,  it  is  necessary  to  introduce  the  ST-IWAIF  model. 
The  model  is  based  on  the  assumption  that  the  listener  can  track  the  changing  IWAIF  of  a 
dynamic  signal  and  use  it  as  a  potential  cue  for  discriminating  between  two  signals.  The  ST- 
IWAIF  of  a  signal  at  time  is  determined  by  the  spectral  properties  of  the  signal  in  a  small  time 
window  of  duration  around  fg.  Let  s,(/)  and  s^{t)  be  the  two  signals,  both  of  duration  T  that 
are  to  be  discrinunated.  The  listener  is  assumed  to  track  the  ST-IWAIF  values  /,(/)and  liit), 
O^tsT,  of  s,(0  and  Sjft),  respectively.  Furffier,  we  assume  that  there  is  internal  noise  in  the 
auditory  system  that  limits  our  ability  to  track  frequency.  This  internal  noise  is  modded  as  an 
additive  noise  w(t)  that  corrupts  the  true  ST-IWAIF  values.  We  assume  that  >«'(/)  is  white. 


zero-mean,  Gausnan  noise  with  a  power  spectral  denaty  of 


Since  the  IWAIF  is  essentially 


a  signal  frequency  parameter,  we  suggest  that  inne  tone  frequency-difference  limen  data  be  used 
for  that  purpose.  Thus,  if  the  frequency  DL  for  a  tone  of  duration  T  at  the  IWAIF  frequency  is 
A ,  we  suggest  that 


(1) 


Given  the  above  assumptions,  the  d  for  this  model  is  given  by 


d  = 


(/,(0 -/,(/))  V/ 


(2) 


The  ST-IWAIF  model  was  applied  to  the  results  of  the  original  STEP-GLIDE 
discrimination  task  with  good  results.  Figure  1  shows  the  duration  of  an  individual  step  for  STEP 
signals  that  were  discriminable  from  GLIDE  signals  on  75%  of  the  trials.  Open  symbols  represent 
results  averaged  for  four  listeners  over  signal  duration  ranging  from  25  to  lOQ  ms.  Performance  is 
collapsed  over  signal  duration  and  overall  frequency  excurrion,  to  be  characterized  by  the  rate  of 
transition.  Transition  rates  were  2-,  4-  and  8-Hz/ms.  Filled  symbols  represent  the  performance 
predicted  by  the  ST-IWAIF  model.  Model  parameto^  were  adjusted  so  that  predictions  and  data 
were  matched  at  1  kHz  for  the  8  Hz/ms  conditions.  Those  same  parameters  were  then  used  to 
predict  performance  for  all  other  conditions.  In  general,  the  model  predicts  average  listener 
performance  very  well.  At  250  Hz  it  predicts  better  performance  (i.e.,  shorter  step  size).  At  4 
kHz,  the  model  predicts  poorer  performance  (larger  step  size)  than  our  listeners  obtained. 


Figure  I.  Just-discriminable  step  size  in  ms.  Open  symbob  are  averaged  for  four  listeners  with 
normal  hearing.  Circles  are  for  a  sweep  rate  of  8  Hz/ms,  squares  are  4  Hz/ms  and  triangles  are 
2  Hz/ms.  Filled  symbob  are  the  predicted  values  obkdned  from  the  ST-IWAIF  model. 


1.2  Madden's  model  for  FM  glide  discrimination 

Recently,  Madden  (1994)  extended  the  investigation  of  temporal  processing  of  FM  glides. 
Using  an  adaptive  2AFC  task,  he  detennined  the  smallest  frequency  increase  between  successive 
steps  at  which  the  STEP  signal  could  just  be  distinguished  from  the  GLIDE.  Madden  modeled  his 
re^ts  u^g  an  intensity-based  model:  A  bank  of  bandpass  filters  each  followed  by  a  non¬ 
linearity,  temporal  window  and  level  detector.  Similar  models  have  been  used  by  several 
investigators  of  auditory  temporal  acuity  (e.g.,  Viemrister,  1979;  Forrest  and  Green,  1987; 
Shailer  and  Moore,  1987;  Green  and  Forrest,  1988).  Madden’s  modeling  indicated  that  the 
equivalent  rectangular  duration  (ERD)  of  the  temporal  window  was  about  S  ms  for  sigtuds 
ran^ng  from  2S0  Hz  to  6  kHz.  However,  he  was  forced  to  allow  detector  efficiency  to  vary 
substantially  over  the  frequency  range  to  obtain  a  fit  to  his  data.  This  is  in  marked  contrast  to  the 
detector  efficiency  reported  for  temporal  acuity  of  signals  without  FM. 

1.3  IWAIF  model  predictions  of  Madden’s  results 

The  ST-IWAIF  model  was  applied  to  Madden’s  results.  The  model  predictions  are  shown 
in  Figure  2  along  with  the  averaged  data  from  Madden’s  paper.  The  ST-IWAIF  model  predicts 
slightly  better  performance  than  Madden’s  listeners  obtained.  Given  that  the  listeners  may  not  be 
100%  efficient,  the  prediction  of  slightly  better  performance  is  not  unexpected.  Behavior  of  the 
model  predictions  is  in  line  with  expectations,  except  for  signals  with  a  large  number  of  steps.  For 
the  difficult  discrimination  of  nine,  ten  or  eleven  steps  versus  the  seventeen  steps  in  the  standard 
signal,  Madden’s  adaptive  procedure  apparently  drove  the  listeners  to  extremely  large  frequency 
increments  between  steps.  We  assume  that  they  were  using  a  different  cue  to  reach  criterion 
when  the  increment  per  step  was  over  100  Hz. 

2  Detection  of  sinusoidal  plus  ramp  FM 

There  are  some  concerns  about  the  influence  of  subtle  spectral  differences  between  the  glide 
and  the  step  signals  with  the  use  of  the  STEP-GLIDE  discrimination  task.  To  minimize  the 
posable  contamination  of  “splatter”  at  each  stq}.  Madden  used  a  17-STEP  signal  as  the  standard 
rather  than  a  true  linear  glide.  Further,  the  transitions  were  ’Younded”  to  reduce  the  spread  of 
energy  when  the  signal  fi^uency  was  abruptly  changed  to  a  new  value. 

Consider  the  STEP  signal  used  in  the  previous  work.  It  can  be  desoibed  as  a  triangular 
wave  modulator  added  to  a  linear  ramp  before  ffie  combined  waveform  is  used  to  modulate  the 
frequency  of  a  carrier  tone.  It  can  be  difficult  to  spedfy  the  modulation  index  of  sudi  a  combined 
modulation  waveform.  If,  instead,  a  sinusoid  is  added  to  a  linear  ramp  to  produce  the  combined 
modulator,  the  resulting  modulator  is  easy  to  spedfy.  This  new  target  agnal  replaces  the  STEP 
signal  used  previously.  If  the  slope  of  the  ramp  is  zero  (i.e.,  no  change  in  base  fiequency),  the 
listener’s  task  is  simply  detection  of  sinusoidal  When  the  sinusoidal  FM  is  added  to  a  linear 
ramp,  the  listener’s  task  is  similar  to  that  in  the  discrimination  of  STEP  versus  GLIDE  signals 
(Zhang  et  al.,  1994). 


Number  of  Steps 


Figure  2.  Comparison  of  Madden ’s  results  at  1  kHz  with  ST-IWAIF  model  predictions. 


2.1  Method 

Listeners  with  normal  hearing  were  asked  to  determine  which  of  two  tones  was  sinusoidally 
frequency  modulated.  In  one  set  of  experimental  conditions,  the  standard  signal  was  a  steady 
tone,  and  the  target  signal  was  generated  by  frequency  modulating  the  standard  with  a  sinusoid  at 
4,  8,  16,  32,  64,  128  or  256  Hz.  In  other  conditions,  the  standard  signal  was  frequency 
modulated  by  a  linear  ramp.  The  target  was  then  generated  by  adding  the  sinusoidal  FM  to  the 
linear  FM.  To  avoid  the  possibility  of  anchoring  effects,  the  frequency  of  each  signal  was  chosen 
from  a  uniform  random  distribution.  This  is  commonly  called  a  roving-frequency  condition. 


2.2  Signal  generation 

All  signals  were  geno-ated  using  the  TDT  System  II.  Three  listeners  were  tested  at  one 
time.  Separate  channds  from  the  four-channel  D-to-A  converter  delivered  signals  to  one  side  of  a 
Sennhdser  HD  414  headset.  Individual  detection  thresholds  for  the'  standard  signals  were 
determined  using  an  adaptive  2AFC  procedure;  the  FM  detection  task  was  conducted  at  SO  dB 
SL.  Signal  duration  was  250  ms  with  rise-fall  times  of  5  ms. 


2.3  Procedures 


Data  were  collected  in  blocks  of  50  trials  using  an  adaptive  2Q,  2AFC  procedure.  A  3-up, 
I -down  rule  was  used  (Levitt,  1971)  to  estimate  the  79  4  %  point  on  the  listeners’  psychometric 
iunctions.  Data  were  collected  from  at  least  six  blocks  of  trials  before  avenging  the  results. 
When  the  results  appeared  to  be  too  variable,  an  additional  three  blocks  were  run  and  the  “best” 
sue  were  averaged. 


3  Results 

3.1  Sinusoidal  FM  detection:  with  or  without  glide 

Figure  3  shows  the  results  for  detection  of  sinusoidal  FM  at  =  1  kHz  averaged  for  three 
listeners.  The  abscissa  displays  modulation  frequency,  ran^g  from  4  to  256  Hz.  The  ordinate  is 
P,  the  index  of  modulation  required  to  obtain  79.4%  correct  detection.  Results  for  detection  of 
sinusoidal  FM  added  to  a  linear  ran^  FM  are  also  shown.  Here  the  ramp  rises  800  Hz  over  250 
ms.  In  general,  listeners  have  more  difficulty  detecting  the  ^usoidal  FM  in  the  presence  of  the 
ramp  than  they  do  when  the  standard  is  an  unmodulated  tone.  The  data  displayed  in  Figure  3 
were  averaged  over  the  four  roving-frequency  ranges  tested.  Remarkably,  there  is  no  effect  of 
roving  on  the  listeners’  ability  to  detect  the  sinusoidal  FM,  either  for  the  steady  base-line,  or  for 
the  ramped  one. 


Figure  3.  Detection  of  sinusoidal  FM  modulation  averaged  for  ilvee  listeners.  Circles  are  for 
simple  FM  detection;  filled  squares  are  for  detection  of  the  sine  FM  added  to  a  linear  glide. 
Starting  frequency  rove  range:  none,  200,  400  and 800  Hz. 


3.2  Sine  wave  analog  to  STEP-GLIDE  discrimination 


Three  listeners  with  normal  hearing  were  tested  in  an  FM  detection  task  constrained  to  be 
analogous  to  the  earlia*  STEP-CHJDE  discrimination  task.  The  STEP  signal  was  replaced  by  a 
sinusoidal  FM  plus  GLIDE  modulator.  Signal  duration  was  set  to  100  ms  and  the  transition  rate 
was  4  Hz/ms.  The  sinusoidal  FM  was  added  to  the  linear  ramp  with  starting  phase  at  ISO**  to 
better  approximate  the  STEP  signal.  Detection  thresholds  for  the  sine  FM  were  obtained  at 
octave  frequencies  from  2S0  Hz  through  4  kHz.  plus  6  kHz.  An  adaptive,  3-up,  1-down  rule 
(Levitt,  1971)  was  used  to  adjust  the  period  of  the  sinusoidal  FM  to  approximate  the  discrete 
steps  used  earlier.  Thus,  for  a  100  ms  signal,  a  modulation  rate  of  10  Hz  completes  one  cycle  in 
100  ms.  To  approximate  2  steps,  the  rate  was  changed  to  20  Hz.  For  each  new  modulation  rate, 
the  amplitude  of  the  siinisoid  was  adjusted  so  that  h  would  have  the  same  power  as  the  triangular 
modulator  that  produced  the  original  step  function.  This  is  only  one  of  several  constraints  that 
might  be  placed  on  the  sinusoid  to  “match"  it  to  the  triangular  waveform. 

Discrimination  results  for  three  listeners  are  shown  in  Figure  4  along  with  the  performance 
predicted  for  each  one  using  the  ST-IWAIF  model.  As  with  the  earlier  STEP-GLIDE  results,  the 
just-dctectat'V  period  is  approximately  uniform  from  250  Hz  to  2  kHz.  Performance  is  poorer  at 
4  kHz  but  at  6  kHz  it  appears  to  have  leveled  off. 


Center  Frequency  (Hz) 


Figure  4.  Individual  results  for  the  listeners  detecting  a  sinusoidal  FM  added  to  a  linear  glide. 
Filled  ^bols  are  ST4WAIF  predictions  for  listener  performance. 


4  Discussion 


Our  original  set  of  STEP-GLIDE  signals  led  us  to  conclude  that  the  width  of  the  temporal 
processing  window  was  approximately  constant  across  signal  frequencies  below  2  kHz  We 
attributed  listeners’  poorer  performaiice  in  STEP-GLIDE  discrimination  at  higher  frequencies  to 
the  ear’s  inability  to  follow  the  frequency  tranation,  perhaps  due  to  the  loss  of  phase-locking  in 
the  primary  neural  units.  Madden’s  extension  of  that  work  ( 1994)  introduced  an  adaptive  testing 
procedure  and  modeled  the  results  using  the  ^’traditional”  intensity  based  modd  used  in  previous 
temporal  acuity  measurements  such  as  gap  detection  or  temporal  modulation  transfer  functions. 
Madden’s  adaptive  procedure  kept  the  signal  duration  and  number  of  steps  fixed  over  a  block  of 
experimental  trials  and  varied  the  frequency  excursion  of  the  signal  pairs.  The  adaptive  procedure 
determined  the  frequency  interval  (H)  for  whidt  listeners  could  distinguish  the  STEP  from  the 
GLIDE.  Recent  work  by  Hsu  <1993)  has  shown  that  the  just-discriminable  chapge  in  frequency- 
transition  slope  obeys  Wd>er’s  law.  That  is,  AF/F  in  Madden’s  procedure  changes  with  each 
change  required  by  the  adaptive  rule. 

The  intensity  model  used  by  Madden  matched  listener  performance  fairly  well  for  the  mid¬ 
range  of  signal  frequences  tested.  The  model  had  to  be  allowed  poorer  detector  efficiency  to 
reproduce  the  upturn  in  just-discrimituible  frequency  increment  at  the  smallest  and  largest  number 
of  steps.  The  large  increase  in  frequency  increment  requir^  to  achieve  discrimination 
performance  at  79.4%  for  the  large  number  of  steps  is  difficult  to  explain  within  the  limits  of  the 
assumed  task  presented  to  the  listeners.  In  essence,  they  are  asked  to  distinguish  a  transition 
containing  9,  10  or  11  disaete  steps  ova  SO  ms  from  a  standard  containing  17  steps  over  that 
same  duration.  The  very  large  frequency  inaement  required  to  “satisfy”  the  adaptive  procedure 
suggests  that  perhaps  at  this  end  of  the  discrimination  task,  listeners  wae  using  a  diffaent  cue  to 
make  the  distinction.  Certainly,  the  inaements  were  large  enough  to  suggest  that  ample 
frequency  discrimination  of  the  initial  frequency  jump  might  be  the  cue. 

At  the  otha  end  of  Madden’s  FI  vs.  st^numba  curve,  there  is  a  drop  in  FI  with  inaeaang 
numbas  of  steps.  For  2,  3  and  4  steps  the  overall  frequency  excursion  is  approximately  150  Hz. 
In  the  earlia  work,  we  found  that  the  duration  of  a  single  step  remained  constfjit  for  criterion 
performance  ova  a  wide  range  of  signal  duration  and  frequency  excursions.  The  frdling  FI  vs. 
step-numba  curve  may  not  reflect  that  same  behavior. 

We  have  shown  above  that  a  ST-IWAIF  model  can  account  for  Madden’s  results  about  as 
well  as  the  intensity-based  model  he  proposed.  Thae  remain  some  reservations  about  signal 
artifiicts  (enagy  splatta  at  the  transitions)  and  problems  with  the  adaptive  procedure  as 
implemented  by  Madden.  Thus,  we  have  proposed  that  sinusoidal  FM  added  to  a  linear  ramp  be 
us^  to  frirtha  test  the  IWAIF  model. 

Our  initial  results  indicate  that  the  ST-IWAIF  model  can  account  for  Ustena  pafomuuice  in 
the  daection  of  rinusoidal  FM.  When  the  ranq}  has  zero  slope,  our  task  is  the  fruniliar  FMDL 
task.  For  that  task,  we  have  iixlicated  that  intr^ucing  a  roving-frequency  paradigm  has  little  or 
no  effect  on  Ustena  performance.  Our  results  appear  to  be  reasonable  when  compared  with 
previous  daerminations  of  the  FMDL  (Moore  and  CHasbeig,  1989)  given  dififerences  in 
psychophysical  procedures  and  signal  paramaas. 

When  the  rinusoidal  FM  is  added  to  a  ramp  modulation,  Ustena  performance  in  the 
daection  of  FM  is  somewhat  poorer  at  the  lowe^  FM  rates.  In  the  “traditional”  FMDL  task,  the 


baseline  for  comparison  is  fixed  in  frequency  over  the  duration  of  the  signal  For  our  ramp-plus- 
sine  FM  task,  the  baseline  is  moving. 

Finally,  when  we  use  the  sinusoidal  FM  plus  GLIDE  signal  to  replicate  the  earlier  STEP- 
GLIDE  discrimination  results,  listener  performance  is  consistent  with  our  earlier  findings.  We 
suggest  that  the  listener’s  inability  to  ‘ToUow”  the  changing  baseline  at  higher  fi-equencies 
(perhaps  due  to  the  loss  of  phase  locking  in  the  auditory  nerve)  probably  accounts  for  the  poorer 
performance  in  both  STEP-GLIDE  discrimination  and  in  FM  detection.  The  ST-IWAIF  model 
predicts  progressively  poorer  perfisrmanoe  above  2  kHz  than  our  listeners  achieve.  This  may 
indicate  that  such  discriminations  are  not  based  on  the  listeners’  ability  to  follow  rapid  fi'equency 
transitions,  as  the  model  assumes.  However,  since  the  model  uses  the  listener’s  DLF  to  estimate 
variance  in  the  tracking  task,  model  performance  is  degraded  as  the  DLF  increases  at  higher  signal 
frequencies. 
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Abstract 

The  Intensity  Weighted  Average  of  Instantaneous 
Frequency  (IWAIF)  model  has  been  successfully  used 
to  explain  a  number  of  psychoacoustic  results  in 
which  the  primary  cue  used  by  the  listener  is  fre¬ 
quency.  The  IWAIF  of  a  signal  is  the  frequency  of 
the  center-of-gravity  of  the  positive  frequency  half 
of  the  signal  spectrum.  With  a  few  exceptions,  the 
IWAIF  model  has  been  applied  only  to  narrowband 
signals.  In  this  paper,  we  propose  a  multi-channel 
extension  of  the  IWAIF  model  that  is  useful  in  ana¬ 
lyzing  wideband  signals.  The  output  of  the  multi¬ 
channel  IWAIF  model  is  a  vector  of  IWAIF  (fre¬ 
quency)  values.  We  then  present  two  applications 
of  the  IWAIF/multi-channel  IWAIF  model:  (1)  to 
explain  the  Chistovich  ‘perceptual  formant”  effect 
observed  in  vowel  perception;  and,  (2)  to  model  the 
detection  of  mixed  amplitude  and  frequency  modula¬ 
tion  (MM)  by  human  listeners.  Comparisons  of  the 
predictions  of  the  model  with  psychoacoustic  data 
show  that  the  model  predictions  are  in  reasonable 
agreement  with  the  data  at  high  modulation  rates 
(256  Hz),  while  at  lower  modulation  rates  (4  Hz,  16 
Hz),  the  model  predicts  a  phase  dependence  that  is 
not  present  in  the  data.  We  speculate  that  at  low 
modulation  frequencies,  a  short-term  IWAIF  model 
may  be  more  appropriate. 

Introduction 

Our  work  is  based  on  the  underlying  assumption  that 
human  auditory  communication  can  be  modeled  as 
a  modulation-demodulation  process.  In  this  model, 
the  information  in  the  sound  pressure  wave  is  en¬ 
coded  as  variations  in  the  amplitude  and  frequency 
of  the  signal.  Speech,  music  and  most  environmen¬ 
tally  important  sounds  can  be  described  in  this  way. 
The  human  listener’s  task,  then,  is  to  demodulate 
the  sound  stream  to  extract  the  encoded  information. 


This  modulation -demodulation  view  of  auditory  pro¬ 
cessing  has  also  been  recently  (and  independently) 
advocated  by  Maragos  et  al.  (1992),  who  have  pro¬ 
posed  a  non-linear  energy  operator  for  extracting 
modulation  information  from  signals.  Fineberg  et 
al.  (1992)  have  also  applied  a  modulation  model  to 
speech  recognition. 

Our  work  in  the  past  several  years  has  concen¬ 
trated  primarily  on  the  processing  of  narrowband  fre¬ 
quency  modulated  signals.  Feth  (1974)  proposed  the 
Envelope  Weighted  Average  of  Instantaneous  Fre¬ 
quency  (EWAIF)  as  a  model  for  the  processing  of 
such  FM  as  well  as  AM  signals.  Recently,  we  have 
developed  the  Intensity  Weighted  Average  of  Instan¬ 
taneous  FVequency  (IWAIF)  as  an  alternative  to  the 
EWAIF  (Anantharaman  et  al.,  1993).  The  IWAIF 
has  a  number  of  advantages  over  the  EWAIF:  (i)  it 
is  easier  to  compute;  and  (ii)  it  has  an  intuitive  fre¬ 
quency  domain  interpretation,  since  the  IWAIF  of  a 
signal  is  simply  the  center-of-gravity  frequency  of  the 
signal  spectrum  (Ananthartunan,  1992). 

The  primary  applications  of  the  EWAIF  and 
IWAIF  models  have  been  to  narrowband  signals  that 
are  confined  to  a  single  critical  band.  Most  real- 
world  signals  such  as  speech  and  music  are  wide¬ 
band.  The  relevant  information  in  such  signals,  such 
as  harmonicity,  correlated  amplitude  or  frequency 
modulation  etc.,  is  often  spread  over  several  critical 
bands.  The  psychoacoustic  phenomena  of  Comod¬ 
ulation  Masking  Release  and  Profile  Analysis  illus¬ 
trate  that  the  humu  auditory  system  is  capable  of 
following  a  multi-component  signal  in  several  chan¬ 
nels  simultaneously  (Green,  1988).  It  is  essential  to 
extend  the  IWAIF  model  to  a  multi-channel  version 
to  analyze  such  wideband  signals.  Any  extension  of 
the  IWAIF  model  to  wideband  signals  will  have  to  in¬ 
corporate  our  existing  knowledge  of  the  auditory  sys¬ 
tem  including  such  features  as  the  basilar  mem¬ 
brane  filtering,  compression,  short-  and  long-term 
adaptation,  phase  loddng  etc.  As  a  first  step  in  this 
direction,  we  present  in  this  paper  a  multi-channel 


IWAIF  model  that  includes  basilar  membrane  filter¬ 
ing  and  spatial  integration.  This  model  is  app'led  to 
two  psychoacoustic  results:  (i)  the  “perceptual  for¬ 
mant”  effect  in  vowel  perception  described  by  Chis- 
tovich  (1979, 1985);  and  (ii)  mixed  modulation  (MM) 
perception. 

The  next  section  describes  the  basic  IWAIF  model; 
then  we  introduce  the  multi-channel  IWAIF  model. 
The  applications  of  the  multi-channel  IW.  ^  model 
to  vowel  perception  and  modulation  detection  form 
the  next  two  sections,  and  we  conclude  with  some 
recommendations  for  future  work. 

Basic  IWAIF  Model 

Let  s(t)  be  a  real  signal,  with  instantaneous  envelope 
e(t)  and  instantaneous  frequency  /(t).  The  IWAIF 
of  the  signal  s(t)  is  defined  as  (Anantharaman  et  al. 
1992), 


IWAIF(s(t)]  = 


(tjdt 


(1) 


A  much  more  convenient  representation  of  the 
IWAIF  of  s(t)  is  obtained  if  the  above  expression 
is  transformed  to  the  frequency  domain.  As  shown 
by  Anantharaman  et  al.  (1993), 


iwAiF[s(t)i  = 

IWAIF[s(()l 


(2) 


where  S(/)  is  the  Fourier  transform  of  s(f).  Thus, 
the  IWAIF  of  a  real  signal  is  the  “center  of  gravity" 
frequency  of  the  positive  portion  of  its  energy  den¬ 
sity  spectrum.  The  IWAIF  can  be  computed  very  ef¬ 
ficiently  using  the  FFT  (Anantharaman  et  al.  1993). 

Multi-channel  IWAIF  Model 


The  multi-channel  IWAIF  model  consists  of  three 
stages  (Mokheimer  1993); 

1. A  filterbank  stage,  that  models  the  bandpass  fil¬ 
tering  of  the  basilar  membrane  on  the  incoming  sig¬ 
nal.  We  use  the  Gammatone  filterbank  proposed  by 
Patterson  et  al.  (1987)  for  this  purpose. 

2.  A  Spatial  integration  stage,  that  combines  the 
output  of  a  number  of  adjacent  filters.  As  presently 
configured,  the  output  from  three  adjacent  filters  are 
combined.  The  decision  to  combine  only  three  adja¬ 
cent  output  was  based  on  the  observation  that  the 
“perceptual  formant”  effect  in  vowel  perception  only 
occurs  if  two  formants  are  less  than  3  critical  bands 
apart  (Chistovich,  1979, 1985). 

3.  An  IWAlF/lntenaity  computation  stage,  that 
computes  the  IWAtt'  and  the  intenrity  at  the  out¬ 
put  of  each  channel. 


Figure  1;  The  proposed  multi-channel  IWAIF  model. 

Figure  1  shows  the  proposed  Multi-channel  IWAIF 
model.  The  blending  weight  for  each  channel  in  com¬ 
bining  the  output  of  adjacent  channels  is  chosen  to  be 
the  relative  intensity  of  that  channel.  Experiments 
with  various  weighting  choices  showed  that  these  lead 
to  the  best  results  for  the  modulation  detection  task. 

Given  the  signal  f(t),  0  <  t  <  T,  the  model  leads 
to  a  vector  of  (IWAIF,  Intensity)  pairs,  one  pair  for 
each  channel;  i.e.,  (/[n],L(n]),  n  =  1,...,7V/,  where 
Nf  is  the  number  of  frequency  channels,  /(n)  is  the 
IWAIF  value  for  the  channel,  and  L\n]  is  the 
intensity  level  for  the  channel. 

Fig.  2  shows  the  output  of  the  multi-channel 
IWAIF  model  to  a  sine  wave  at  1000  Hz.  Notice  that 
an  IWAIF  value  is  computed  even  for  channels  with 
center  frequencies  far  from  1000  Hz,  whose  output 
intensity  is  very  small.  This  is  because  the  IWAIF 
itself  is  independent  of  signal  energy,  and  the  IWAIF 
value  computed  in  these  channels  is  dominated  by 
round-off  noise  and  the  filter  impulse  response.  We 
make  the  assumption  that  the  auditory  system,  in 
most  situations,  ignores  channels  with  relatively  low 
energy.  Thus,  we  only  retain  for  further  processing 
those  channels  whose  relative  intensity  is  within  35 
dB  of  the  maximum  intensity.  Notice  that  for  these 
channels,  the  computed  IWAIF  is  very  close  to  1000 
Hz. 

Modeling  the  “Perceptual  Formant”  effect 
using  the  IWAIF 

In  a  series  of  papers  on  the  perception  of  vowel  qual¬ 
ity,  Chistovich  and  her  coUeagues  (1979, 1985)  asked 


Figure  2;  Output  of  the  multi-channel  IWAIF  model 
to  a  tone  at  1000  Hz.  The  intensity  at  the  output 
of  a  channel  is  plotted  against  the  channel  center 
frequency  in  the  top  graph;  the  bottom  graph  shows 
the  IWAIF  value  against  the  center  frequency.  The 
vertical  dashed  lines  bracket  the  set  of  channels  in 
which  the  intensity  is  within  35  dB  of  the  maximum. 

listeners  to  match  synthesized  two  formants  stimuli 
to  a  single  formant  stimulus  whose  center  frequency 
could  be  changed.  The  levels  of  the  two  formants  in 
the  two  formant  stimulus  were  varied.  Chbtovich 
and  her  colleagues  found  that  as  long  as  the  two 
formants  are  less  than  3.5  critical  bands  apart,  lis¬ 
tener’s  matched  the  two  formant  stimulus  to  a  single 
formant  stimulus  whose  center  frequency  was  equal 
to  the  frequency  of  the  center  of  gravity  of  the  two 
formant  stimulus.  Subsequent  work  by  others  (Bed- 
dor  and  Hawkins,  1990)  has  lead  to  a  better  under¬ 
standing  of  the  factors  that  govern  the  “perceptual 
formant”  effect,  but  the  emstence  of  the  effect  itself 
is  generally  accepted.  Chistovich  suggests  that  the 
auditory  system  performs  a  spatial  (i.e.  spectral)  in¬ 
tegration  over  wide  intervals  of  the  cochlea,  which 
leads  to  the  “perceptual  formant”  effect.  Using 
steady-state  vowels  sounds,  she  demonstrated  that 
this  model  predicted  listener  performance  in  vowel 
recognition  experiments  very  well.  As  explained  ear¬ 
lier,  the  IWAIF  of  a  signal  is  the  “center  of  gravity” 
of  the  positive  frequency  half  of  the  spectrum;  hence 
the  “perceptual  formant”  should  be  at  the  IWAIF 
frequency.  In  Figure  3  we  compare  the  results  of 
Chistovich  for  the  frequency  of  the  “perceptual  for¬ 
mant”  that  listeners  matched  for  two  formant  vow¬ 
els,  with  the  IWAIF.  As  can  be  seen,  the  agreement 
is  quite  good. 

Application  of  the  Multichannel  IWAIF 

Model  to  Mixed  Modulation  Detection 

The  human  auditory  system  appears  to  use  mod¬ 
ulation  as  an  important  cue  in  grouping  the  sepa¬ 


Figure  3:  This  figure  compares  the  IWAIF  frequency 
of  synthetic  two  formant  stimuli  (solid  line)  with  the 
center  frequency  of  a  synthetic  one-formant  stimulus 
that  Chistovich ’s  listeners  matched  to  the  two  for¬ 
mant  stimulus.  The  abscissa  is  the  difference  in  level 
between  tli.  formants  in  the  two  formant  stimuli. 

rate  components  of  a  signal.  We  can  easily  detect 
both  amplitude  and  frequency  modulation,  and  com¬ 
binations  of  both,  called  mixed  modulation  (MM). 
Psychoacoustic  studies  suggest  that  fairly  complex 
perception  mechanisms,  which  depend  on  the  type 
and  frequency  of  modulation  (Ozimek  and  Sek,  1987; 
Moore  and  Sek,  1992),  are  involved  in  detecting  MM 
signals.  A  number  of  models  for  modulation  detec¬ 
tion  have  been  proposed  (Hartman,  1982,  Moore, 
1989,  Zwicker  1%2  and  Florentine  1981),  but  each 
is  unable  to  account  for  some  of  the  psychoacoustic 
data. 

Following  Ozimek  and  Sek  (1987)  and  Hartmann 
and  Hnath  (1982),  the  MM  signal  can  be  written  as: 

o(t)  =  Ao(l  -1-  incoswmf)  sin(wot  -)-  /? sin(w,nf  +  4>)), 

(3) 

where 

Ao=carrier  amplitude,  u;m=niodulation  angular  fre¬ 
quency,  u;o=carrier  angular  frequency, 
Aa;=maximum  frequency  deviation,  m  =  AM  mod¬ 
ulation  index,  P  =  A(<;/ct;m==FM  modulation  index, 
and  ^  =  relative  phase  angle  between  AM  and  FM. 

Assuming  that  mP  <;  1,  the  spectrum  of  a  MM 
signal  consbts  of  three  components,  the  central  of 
which  represents  the  carrier,  while  the  sidebands  are 
due  to  the  modulation  effect.  The  amplitudes  and 
phases  of  the  sidebands  depend  on  the  phase  shift 
between  the  signab  that  modulate  the  amplitude  and 
frequency  of  the  carrier,  and  the  relative  leveb  of  AM 
and  FM.. 

The  insets  in  Figure  4  show  the  waveforms  (top) 
and  schematics  of  the  spectral  magnitude  (bottom) 
of  mixed  modulation  signab  with  different  relative 
phase  angles  ^  between  AM  and  FM.  Other  param- 
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eters  of  these  signals  are:  carrier  frequency  =  1000 
Hz,  modulating  frequency  =  256  Hz,  m  =  0.1  and 
P  =  0.1.  Also  shown  in  Fig.  4  are  the  multi-channel 
IWAIF  values  for  the  carrier  alone  (filled  circles)  and 
the  MM  signal  (open  squares).  Only  those  channels 
whose  intensity  is  within  35  dB  of  the  maximum  are 
shown  in  the  figure. 

Fig.  4  clearly  shows  how  the  phase  angle  <t>  effects 
the  degree  of  asymmetry  of  the  MM  spectra,  which 
is  in  turn  reflected  in  the  multi-channel  IWAIF.  As 
an  example  for  case  ^  =  0  the  amplitude  of  the  up¬ 
per  sideband  is  higher  than  the  amplitude  of  the 
lower  sideband.  Consequently,  the  IWAIF  values  at 
the  output  of  the  channels  centered  at  frequencies 
higher  than  the  carrier  frequency  (high  frequency 
side  channels)  are  perturbed  further  from  the  car¬ 
rier  frequency  than  the  IWAIF  values  for  the  low 
frequency  side  channels.  A  similar  explanation  ap¬ 
plies  to  the  other  cases  of 

To  derive  quantitative  results  comparing  the  pre¬ 
dictions  of  the  multi-channel  IWAIF  model  to  lis¬ 
tener  performance  in  detecting  MM  signals,  it  is  nec¬ 
essary  to  choose  an  apropriate  detection  model.  We 
have  adapted  the  multi-channel  detector  proposed 
by  Durlach  et  al  (1986)  for  this  purpose.  The  detec¬ 
tor  model  is  based  on  the  following  assumptions:  (i) 
The  signal  is  detected  by  the  changes  in  the  IWAIF 
values  it  produces  in  different  channels;  and  (ii)  In¬ 
ternal  noise  is  present  in  each  channel,  and  is  added 
after  IWAIF  computation  i.e.  it  serves  to  perturb 
the  computed  IWAIF  values.  The  noise  is  assumed 
to  be  zero-mean,  Gaussian,  and  statistically  indepen¬ 
dent  across  channels,  and  to  be  independent  of  the 
stimulus.  The  variance  of  the  noise  is  assumed  to  be 
frequency-dependent. 

Under  these  assumptions,  following  Durlach  et  al., 
the  sensitivity  d'  is  given  by 
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1A(0-/2(01 
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MO 


(4) 


where  /i(i)  and  /2(i)  are  the  IWAIF  values  in  the 
ith  frequency  channel  for  the  signals  to  be  discrimi¬ 
nated,  and  <rjp(t)  is  the  noise  standard  deviation  for 
the  ith  frquency  channel,  Nm  represents  the  num¬ 
ber  of  the  channels  of  interest  with  sufficiently  high 
intensity,  and  K  is  free  parameter  that  represents 
the  efficiency  of  the  detection  medtanism.  Since  the 
IWAIF  is  basically  a  frequency  value,  (Tf  (i)  can  be 
chosen  the  frequency  difference  limen  (FDL)  at  the 
center  frequency  of  the  ith  channel  and  duration  of 
interest.  We  used  the  FDL  data  available  in  (Moore, 
1974)  for  different  center  frequencies  and  a  tone  dura¬ 
tion  of  100  ms  to  obtain  the  values  of  (rr(i).  Finally, 
the  parameter  K  is  estimated  by  matching  the  pre¬ 
diction  of  the  model  to  published  FM  detection  data 
in  (Moore  and  Sek,  1992). 
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Figure  4:  The  waveform  (top  inset),  spectral  mag¬ 
nitude  (bottom  inset),  and  multi-channel  IWAIF  of 
MM  signal  (open  squares)  with  different  relative 
phase  angles  (^)  between  AM  and  FM.  Also  shown 
are  the  multi-channel  IWAIF  values  of  the  carrier 
alone  (filled  circles). 


Moore  and  Sek  (1992)  investigated  the  amount 
of  FM  (at  a  fixed  phase  angle  <t>)  that  needs  to  be 
added  to  a  signal  containing  a  fixed,  “sub-threshold” 
amount  of  AM,  until  the  resulting  MM  signal  is  just 
discriminable  from  the  carrier  signal  alone.  Listener 
performance  on  this  task  changes  considerably  de¬ 
pending  on  the  modulating  frequency.  At  low  mod¬ 
ulation  frequencies  (4  Hz-16  Hz),  the  amount  of  FM 
needed  for  detection  is  nearly  independent  of  the 
phase  Fc^  a  modulation  frequency  of  256  Hz,  on 
the  other  hand,  the  amount  of  FM  needed  for  detec¬ 
tion  shows  strong  phase  effects. 

Figure  5  compares  listener  performance  (open 
symbob)  with  the  model  prediction  (filled  symbols) 
for  an  FM  frequency  of  256  Hz  and  4  Hz.  The  data 
show  that  at  256  Hz  frequency,  there  were  very  large 
effects  of  the  relative  phase  <j>.  For  4  =  Q  (circle 
s}unbol)  ,  i.e.,  when  the  maxima  in  amplitude  and 
frequency  were  coincident,  the  coexisting  AM  made 
the  FM  harder  to  discriminate  from  the  carrier,  i.e., 
an  increase  in  AM  depth  (m)  caused  an  increase  in 
the  FM  index  {0)  required  for  threshold.  An  opposite 
effect  was  observed  for  4  =  180,  when  maxima  in  am¬ 
plitude  and  minima  of  the  frequency  were  coinddent: 
an  increase  in  m  caused  a  significant  decrease  of  0 
required  for  threshold.  For  =  90  or  270,  the  value 
of  0  required  for  threshold  decrease  slightly  with  in¬ 
creasing  m.  This  supports  the  idea  that  the  spectral 
structure  of  the  modulated  signal  and  the  frequency 
selectivity  of  the  auditory  system  are  the  bases  for 
the  discrimination,  and  the  temporal  fine  structure 
(i.e.,  changes  in  frequency  and  amplitude  over  time) 
does  not  play  any  role  in  the  detection  process.  The 
multichannel  IWAIF  predictions  in  this  case  agree 
quite  well  vdth  the  data. 

The  results  at  the  lowest  modulation  frequency 
(4  Hz)  are  also  shown  in  Fig.  5.  Here  the  listener 
data  (open  drdes)  show  no  clear  effect  of  the  rela¬ 
tive  phase  4'  This  result  was  tested  with  the  multi¬ 
channel  IWAIF  model.  The  model  predicts  an  effect 
of  the  relative  phase  that  was  not  observed  in  the 
data;  at  the  same  time,  for  ^  =  90  and  270,  the 
model  predicts  that  a  lesser  amount  of  FM  modula¬ 
tion  is  needed  as  compared  to  the  listener  data. 
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Figure  5:  Comparison  of  listener  performance  with 
multi-channel  IWAIF  model  predictions  for  the  dis- 
ermination  of  a  MM  signal  from  the  carrier.  The 
modulation  frequency  is  256  Hz  for  the  top  graph 
and  4  Hz  for  the  bottom  graph. 


Discussion 


The  multi-channel  IWAIF  model  predicts  Ustener  de¬ 
tection  of  MM  signals  at  high  modulation  rates  quite 
well,  but  fails  at  the  lower  modulation  frequencies. 
This  is  also  true  of  other  models,  such  as  the  one  pro¬ 
posed  Hartmann  and  Hnath  (1982).  A  common 
feature  of  both  these  models  is  that  they  use  only 
the  spectral  properties  of  the  rignal  and  ignore  the 
temporal  structure.  We  speculate  that  at  very  low 


modulation  rates  (4  Hz),  when  listeners  can  easily 
follow  the  temporal  structure  of  the  signal,  detection 
is  determined  by  temporal  properties.  At  moderate 
modulation  rates  (16  Hz-64  Hz),  perhaps  a  combina¬ 
tion  of  temporal  and  spectral  effects  contribute  to  de¬ 
tection.  We  believe  that  a  short-term  IWAIF  model 
may  be  more  applicable  at  the  low  modulation  rates 
(Krishnamurthy  and  Feth,  1993). 

Conclusions 

We  have  presented  a  multi-channel  IWAIF  model 
that  is  applicable  to  wideband  signals,  and  incorpo¬ 
rates  basilar  membrane  filtering  and  spatial  integra¬ 
tion.  An  application  of  the  model  to  the  detection 
of  mixed  modulation  was  described.  The  results  in¬ 
dicate  thatthe  model  matches  listener  performance 
at  high  modulation  rates.  We  plan  to  extend  the 
model  to  include  more  stages  of  auditory  processing 
such  as  temporal  integration.  Also,  we  will  combine 
the  multi-channel  IWAIF  and  the  short-term  IWAIF 
models  leading  to  a  model  that  is  useful  for  time- 
varying,  broadband  signals. 
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